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Abstract 


Mean field variational Bayes (MFVB) is a popular posterior approximation method due to its 
fast runtime on large-scale data sets. However, a well known major failing of MFVB is that it 
underestimates the uncertainty of model variables (sometimes severely) and provides no infor¬ 
mation about model variable covariance. We generalize linear response methods from statistical 
physics to deliver accurate uncertainty estimates for model variables—both for individual vari¬ 
ables and coherently across variables. We call our method linear response variational Bayes 
(LRVB). When the MFVB posterior approximation is in the exponential family, LRVB has a 
simple, analytic form, even for non-conjugate models. Indeed, we make no assumptions about 
the form of the true posterior. We demonstrate the accuracy and scalability of our method on a 
range of models for both simulated and real data. 


1 Introduction 


With increasingly efficient data collection methods, scientists are interested in quickly analyzing 
ever larger data sets. In particular, the promise of these large data sets is not simply to fit old models 
but instead to learn more nuanced patterns from data than has been possible in the past. In theory, 
the Bayesian paradigm yields exactly these desiderata. Hierarchical modeling allows practitioners 
to capture complex relationships between variables of interest. Moreover, Bayesian analysis allows 
practitioners to quantify the uncertainty in any model estimates—and to do so coherently across all 
of the model variables. 

Mean field variational Bayes (MFVB), a method for approximating a Bayesian posterior distri¬ 
bution, has grown in popularity due to its fast runtime on large-scale data sets OH[6). But a well 
known major failing of MFVB is that it gives underestimates of the uncertainty of model variables 
that can be arbitrarily bad, even when approximating a simple multivariate Gaussian distribution 
EHHEq). Also, MFVB provides no information about how the uncertainties in different model 
variables interact E El EH ED . 

By generalizing linear response methods from statistical physics IEHI3II9) to exponential fam¬ 
ily variational posteriors, we develop a methodology that augments MFVB to deliver accurate uncer- 
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tainty estimates for model variables—both for individual variables and coherently across variables. 
In particular, as we elaborate in Section [2j when the approximating posterior in MFVB is in the ex¬ 
ponential family, MFVB defines a fixed-point equation in the means of the approximating posterior, 
and our approach yields a covariance estimate by perturbing this fixed point. We call our method 
linear response variational Bayes (LRVB). 

We provide a simple, intuitive formula for calculating the linear response correction by solving 
a linear system based on the MFVB solution (Section |Z2| ). We show how the sparsity of this system 
for many common statistical models may be exploited for scalable computation (Section [23}. We 
demonstrate the wide applicability of LRVB by working through a diverse set of models to show that 
the LRVB covariance estimates are nearly identical to those produced by a Markov Chain Monte 
Carlo (MCMC) sampler, even when MFVB variance is dramatically underestimated (Section]?}. 
Finally, we focus in more depth on models for finite mixtures of multivariate Gaussians (Section|33 ), 
which have historically been a sticking point for MFVB covariance estimates O 20j. We show that 
LRVB can give accurate covariance estimates orders of magnitude faster than MCMC (Section [33} . 
We demonstrate both theoretically and empirically that, for this Gaussian mixture model, LRVB 
scales linearly in the number of data points and approximately cubically in the dimension of the 
parameter space (Section [3~4] ). 

Previous Work. Linear response methods originated in the statistical physics literature mmm 
19]. These methods have been applied to find new learning algorithms for Boltzmann machines f8t 
covariance estimates for discrete factor graphs l24l . and independent component analysis ED]. lfl8ll 
states that linear response methods could be applied to general exponential family models but works 
out details only for Boltzmann machines. fT4ll . which is closest in spirit to the present work, derives 
general linear response corrections to variational approximations; indeed, the authors go further to 
formulate linear response as the first term in a functional Taylor expansion to calculate full pairwise 
joint marginals. However, it may not be obvious to the practitioner how to apply the general formulas 
of ESI- Our contributions in the present work are (1) the provision of concrete, straightforward 
formulas for covariance correction that are fast and easy to compute, (2) demonstrations of the 
success of our method on a wide range of new models, and (3) an accompanying suite of code 


2 Linear response covariance estimation 

2.1 Variational Inference 

Suppose we observe N data points, denoted by the TV-long column vector x, and denote our un¬ 
observed model parameters by 0. Here, 0 is a column vector residing in some space 0; it has J 
subgroups and total dimension D. Our model is specified by a distribution of the observed data 
given the model parameters—the likelihood p{x\6) —and a prior distributional belief on the model 
parameters p(9). Bayes’ Theorem yields the posterior p{0\x). 

Mean-field variational Bayes (MFVB) approximates p(6\x) by a factorized distribution of the 
form q(Q) = Ylj=i Q(0j)- Q * s chosen so that the Kullback-Liebler divergence KL(q\\p) between 
q and p is minimized. Equivalently, q is chosen so that E := L + S, for L := E q [\ogp(6\x)] 
(the expected log posterior) and S := —E g [log q(0)\ (the entropy of the variational distribution), is 
maximized: 

g* := argminKL(g||p) = argminE g [log q{6) — logp(6\x)] = argmaxE 1 . (1) 

q q q 
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Up to a constant in 0 , the objective E is sometimes called the “evidence lower bound”, or the ELBO 
0 . In what follows, we further assume that our variational distribution, q ( 6 ), is in the exponential 
family with natural parameter r] and log partition function A: log q(0\rj) = p T 0 — A ( 77 ) (expressed 
with respect to some base measure in 0). We assume that p (6\x) is expressed with respect to the 
same base measure in 0 as for q. Below, we will make only mild regularity assumptions about the 
true posterior p{6 \x) and no assumptions about its form. 

If we assume additionally that the parameters 77 * at the optimum q*(6) = q(0\p*) are in the 
interior of the feasible space, then q(9\rj) may instead be described by the mean parameterization: 
777 := E q 0 with m* := E q * 0 . Thus, the objective E can be expressed as a function of m, and the 
first-order condition for the optimality of q* becomes the fixed point equation 


dE 

dm 


= 0 O 


as 

dm 


m 


dE 

= m* M (777*) = 777* for M(m) := —-b m. (2) 

dm 


2.2 Linear Response 

Let V denote the covariance matrix of 0 under the variational distribution q*(0), and let E denote 
the covariance matrix of 0 under the true posterior, p(6\x)\ 

V := Co v q *0, E := Co \ p 6. 

In MFVB, V may be a poor estimator of E, even when 777* « E p 6, i.e., when the marginal estimated 
means match well Emilia. Our goal is to use the MFVB solution and linear response methods 
to construct an improved estimator for E. We will focus on the covariance of the natural sufficient 
statistic 0 , though the covariance of functions of 6 can be estimated similarly (see Appendix |A|). 

The essential idea of linear response is to perturb the first-order condition M (777*) = 777* around 
its optimum. In particular, define the distribution p t (6\x) as a log-linear perturbation of the poste¬ 
rior: 


\ogp t {0\x) := \ogp(6\x)+t T 6 - C (t), (3) 

where C ( t ) is a constant in 6. We assume that p t (0\x) is a well-defined distribution for any t in an 
open ball around 0 . Since C ( t ) normalizes p t (0\x), it is in fact the cumulant-generating function 
of p(6\x), so the derivatives of C (t) evaluated at t = 0 give the cumulants of 6. To see why this 
perturbation may be useful, recall that the second cumulant of a distribution is the covariance matrix, 
our desired estimand: 


E = Co y p (0) = 


dt T dt 


C(t) 


7=0 




7=0 


The practical success of MFVB relies on the fact that its estimates of the mean are often good 
in practice. So we assume that m\ « E Pt 0, where m* is the mean parameter characterizing q% and 
q% is the MFVB approximation to p t . (We examine this assumption further in Section [3]) Taking 
derivatives with respect to t on both sides of this mean approximation and setting t = 0 yields 


E = Co v p (0) 


dm* t 

dt T 


=: E, 
7=0 


(4) 


where we call E the linear response variational Bayes (LRVB) estimate of the posterior covariance 
of <9. 


3 









We next show that there exists a simple formula for E. Recalling the form of the KL divergence 
(see Eq. {!])), we have that —KL(q\\p t ) = E + t T m =: E t . Then by Eq. (|2|, we have raj = M t (m*) 
for M t (m ) := M(ra) + t. It follows from the chain rule that 


draj dM t dm* t dM t dM t 

dt dm T m=m * dt + dt dm T 

where I is the identity matrix. If we assume that we are at a strict local optimum and so can invert 
the Hessian of E , then evaluating at t = 0 yields 


draj 

dt 


(5) 


draj 

dt T 


t =o 


dM 

dm 


£ +J 


o 2 e 

dmdm T 



£ +J 


E 


d 2 E 

dmdm T 


-l 


( 6 ) 


where we have used the form for M in Eq. ([ 2 ]). So the LRVB estimator E is the negative inverse 
Hessian of the optimization objective, E , as a function of the mean parameters. It follows from 
Eq.§ that E is both symmetric and positive definite when the variational distribution q* is at least 
a local maximum of E. 

We can further simplify Eq. 0 by using the exponential family form of the variational approxi¬ 
mating distribution q. For q in exponential family form as above, the negative entropy —S is dual to 
the log partition function A ll22l . so S = —iq T m A A ( 77 ); hence, 


dS dS dr] dS (dA \ dr] 

to = + s^ = (^~ m ) 


Recall that for exponential families, drj(m)/dm = V 1 . So Eq. ^ become^] 


/ d 2 L d 2 S \ 1 
V dmdm T drndrri T ) 


(H-V- 1 )- 1 , for H := 


d 2 L 

dmdm T 


E = (/ -VH)~ L V. 


(7) 


When the true posterior p(6\x) is in the exponential family and contains no products of the vari¬ 
ational moment parameters, then H = 0 and E = V. In this case, the mean field assumption is 
correct, and the LRVB and MFVB covariances coincide at the true posterior covariance. Further¬ 
more, even when the variational assumptions fail, as long as certain mean parameters are estimated 
exactly, then this formula is also exact for covariances. E.g., notably, MFVB is well-known to pro¬ 
vide arbitrarily bad estimates of the covariance of a multivariate normal posterior |2| HUES 123, but 
since MFVB estimates the means exactly, LRVB estimates the covariance exactly (see Appendix[B]). 


2.3 Scaling the matrix inverse 

Eq.0 requires the inverse of a matrix as large as the parameter dimension of the posterior p(6\x), 
which may be computationally prohibitive. Suppose we are interested in the covariance of parameter 
sub-vector a , and let z denote the remaining parameters: 0 = (a, z) T . We can partition E = 
(E a , E^). Similar partitions exist for V and H. If we assume a mean-field factorization 

q(a, z) = q(a)q(z), then V az = 0. (The variational distributions may factor further as well.) We 
calculate the Schur complement of E in Eq. ([ 7 ]) with respect to its zth component to find that 

= (la ~ V a H a - V a H az (I z - V z H z )- l V z H za f' V a . ( 8 ) 

'For a comparison of this formula with the frequentist “supplemented expectation-maximization” procedure see Ap- 
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Here, I a and I z refer to a- and z -sized identity matrices, respectively. In cases where (I z — V z H z )~ l 
can be efficiently calculated (e.g., all the experiments in Section [3} see Fig. ([5]) in Appendix |D|), 
Eq. ([8| requires only an ct-sized inverse. 


3 Experiments 


We compare the covariance estimates from LRVB and MFVB in a range of models, including models 
both with and without conjugacy^] We demonstrate the superiority of the LRVB estimate to MFVB 
in all models before focusing in on Gaussian mixture models for a more detailed scalability analysis. 

For each model, we simulate datasets with a range of parameters. In the graphs, each point 
represents the outcome from a single simulation. The horizontal axis is always the result from an 


MCMC procedure, which we take as the ground truth. As discussed in Section \L2\ the accuracy of 
the LRVB covariance for a sufficient statistic depends on the approximation raj « E Pt 6 . In the mod¬ 
els to follow, we focus on regimes of moderate dependence where this is a reasonable assumption 
for most of the parameters (see Section |L2| for an exception). Except where explicitly mentioned, 
the MFVB means of the parameters of interest coincided well with the MCMC means, so our key 
assumption in the LRVB derivations of Section [2] appears to hold. 


3.1 Normal-Poisson model 

Model. First consider a Poisson generalized linear mixed model, exhibiting non-conjugacy. We 
observe Poisson draws y n and a design vector x n , for n = 1,..., N. Implicitly below, we will 
everywhere condition on the x n , which we consider to be a fixed design matrix. The generative 
model is: 


Z n \(3,T m ~ P J\T (z n \/3x n ,T *) , y n \z n P Poisson (y n \ exp(;z n )), (9) 

/3 - V(/?|0,oj), r ~ r(r|o; T , /? T ). 

For MFVB, we factorize q (/3, r,z) = q (/3) q (r) Yln=i Q ( z n )• Inspection reveals that the optimal 
q (/ 3 ) will be Gaussian, and the optimal q (r) will be gamma (see Appendix |d|). Since the optimal 
q(z n ) does not take a standard exponential family form, we restrict further to Gaussian q(z n ). There 
are product terms in L (for example, the term E q [r] E q [/?] E g [z n ]) 9 so H ^ 0, and the mean field 
approximation does not hold; we expect LRVB to improve on the MFVB covariance estimate. A 
detailed description of how to calculate the LRVB estimate can be found in Appendix |D| 

Results. We simulated 100 datasets, each with 500 data points and a randomly chosen value for 
fj, and r. We drew the design matrix x from a normal distribution and held it fixed throughout. We 
set prior hyperparameters cr| = 10, a T = 1, and /3 r = 1. To get the “ground truth” covariance 
matrix, we took 20000 draws from the posterior with the R MCMCglmm package 13, which used 
a combination of Gibbs and Metropolis Hastings sampling. Our LRVB estimates used the autodif¬ 
ferentiation software JuMP m. 

Results are shown in Fig. 0- Since r is high in many of the simulations, z and /? are correlated, 
and MFVB underestimates the standard deviation of /? and r. LRVB matches the MCMC standard 
deviation for all /?, and matches for r in all but the most correlated simulations. When r gets very 
high, the MFVB assumption starts to bias the point estimates of r, and the LRVB standard deviations 

2 All the code is available on our Github repository, rgiordan/LinearResponseVariationalBayesNIPS2015 
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start to differ from MCMC. Even in that case, however, the LRVB standard deviations are much more 
accurate than the MFVB estimates, which underestimate the uncertainty dramatically. The final plot 
shows that LRVB estimates the covariances of z with /?, r, and log r reasonably well, while MFVB 
considers them independent. 


p mean 



MCMC mean 


p sd 



0.00 0.02 0.04 0.06 
MCMC std dev 


x mean 



MCMC mean 


x sd cov with z 

m i ::r/ 


MCMC std dev 


- 0.2 0.0 0.2 
MCMC cov 


Method: 

• Irvb 

• mfvb 


Figure 1: Posterior mean and covariance estimates on normal-Poisson simulation data. 


3.2 Linear random effects 

Model. Next, we consider a simple random slope linear model, with full details in Appendix[E] We 
observe scalars y n and r n and a vector x n , for n = 1 ,..., N. Implicitly below, we will everywhere 
condition on all the x n and r n , which we consider to be fixed design matrices. In general, each 
random effect may appear in multiple observations, and the index k(n) indicates which random 
effect, Zk, affects which observation, y n . The full generative model is: 

Vn\P,z,T *"~ p Af(y n \/3 T x n +r n z k{n) ,T~ 1 ), z k \v Af (z k \0, v ~ l ), 

P ~ AT(f3\0, £3), v ~ T{v\a v ,fi v ), r ~ r(r|a T ,/? T ). 

We assume the mean-field factorization q (/?, z/, r,z) = q (/?) q (r) q (v) Y\^ =1 Q ( z n )• Since this is 
a conjugate model, the optimal q will be in the exponential family with no additional assumptions. 


Results. We simulated 100 datasets of 300 datapoints each and 30 distinct random effects. We 
set prior hyperparameters to ol v = 2, /3 U = 2, a T = 2 , /3 r = 2, and = 0.1 -1 /. Our x n was 
2-dimensional. As in Section 3.1 we implemented the variational solution using the autodifferenti- 
ation software JuMP [ 1()J|. The MCMC fit was performed with using MCMCglmm 0. 

Intuitively, when the random effect explanatory variables r n are highly correlated with the fixed 
effects x n , then the posteriors for z and /3 will also be correlated, leading to a violation of the 
mean field assumption and an underestimated MFVB covariance. In our simulation, we used r n = 
x\ n + AT(0,0.4), so that r n is correlated with x\ n but not x^n • The result, as seen in Fig. ([ 2 ]), 
is that (3i is underestimated by MFVB, but /?2 is not. The v parameter, in contrast, is not well- 
estimated by the MFVB approximation in many of the simulations. Since the LRVB depends on the 
approximation m* t ~ E Pt /9, its LRVB covariance is not accurate either (Fig. ([2])). However, LRVB 
still improves on the MFVB standard deviation. 


3.3 Mixture of normals 

Model. Mixture models constitute some of the most popular models for MFVB application 0 0] 
and are often used as an example of where MFVB covariance estimates may go awry [ElEOJ. Thus, 
we will consider in detail a Gaussian mixture model (GMM) consisting of a iC-component mixture 
of P-dimensional multivariate normals with unknown component means, covariances, and weights. 
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Figure 2: Posterior mean and covariance estimates on linear random effects simulation data. 


In what follows, the weight 7^ is the probability of the kth component, fi^ is the P-dimensional 
mean of the kth component, and is the P x P precision matrix of the kth component (so A^ 1 is 
the covariance parameter). N is the number of data points, and x n is the nth observed P-dimensional 
data point. We employ the standard trick of augmenting the data generating process with the latent 
indicator variables z n k , for n = 1, ...,7V and k = 1,...,P, such that z n k = 1 implies x n ~ 
A^ 1 ). S° generative model is: 

P{z n k = 1) = TTfe, p(x\n, n, A, z) = JJ P[ N(x n \pk, K 1 ) Znk (10) 

n=l:N k=l:K 

We used diffuse conditionally conjugate priors (see Appendix[F]for details). We make the variational 
assumption q (/i, 7r, A, z) = Y\k=i Q (pO Q (A&) Q i^k) Yln=i Q ( z n )• We compare the accuracy and 
speed of our estimates to Gibbs sampling on the augmented model (Eq. ( fTO} ) using the function 
rnmixGibbs from the R package bayesm. We implemented LRVB in C++, making extensive use 
of RcppEigen m We evaluate our results both on simulated data and on the MNIST data set 0. 

Results. For simulations, we generated N = 10000 data points from K = 2 multivariate normal 
components in P = 2 dimensions. MFVB is expected to underestimate the marginal variance of /i, 
A, and log(7r) when the components overlap since that induces correlation in the posteriors due to 
the uncertain classification of points between the clusters. We check the covariances estimated with 
Eq. ^ against a Gibbs sampler, which we treat as the ground truth]^] 

We performed 198 simulations, each of which had at least 500 effective Gibbs samples in each 
variable—calculated with the R tool effectiveSize from the coda package ITbl . The first three plots 
show the diagonal standard deviations, and the third plot shows the off-diagonal covariances. Note 
that the off-diagonal covariance plot excludes the MFVB estimates since most of the values are 
zero. Fig. ([3]) shows that the raw MFVB covariance estimates are often quite different from the 
Gibbs sampler results, while the LRVB estimates match the Gibbs sampler closely. 

For a real-world example, we fit a K = 2 GMM to the N = 12665 instances of handwritten 0s 
and Is in the MNIST data set. We used PC A to reduce the pixel intensities to P = 25 dimensions. 
Full details are provided in Appendix [G] In this MNIST analysis, the A standard deviations were 
under-estimated by MFVB but correctly estimated by LRVB (Fig. ([3])); the other parameter standard 
deviations were estimated correctly by both and are not shown. 

3 The likelihood described in Section |33| is symmetric under relabeling. When the component locations and shapes have a 
real-life interpretation, the researcher is generally interested in the uncertainty of /i, A, and i r for a particular labeling, not the 
marginal uncertainty over all possible re-labelings. This poses a problem for standard MCMC methods, and we restrict our 
simulations to regimes where label switching did not occur in our Gibbs sampler. The MFVB solution conveniently avoids 
this problem since the mean held assumption prevents it from representing more than one mode of the joint posterior. 
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Figure 3: Posterior mean and covariance estimates on GMM simulation and MNIST data. 


3.4 Scaling experiments 

We here explore the computational scaling of LRVB in more depth for the finite Gaussian mixture 
model (Section |33] ). In the terms of Section[23] a includes the sufficient statistics from /i, 7 r, and A, 
and grows as 0(KP 2 ). The sufficient statistics for the variational posterior of /1 contain the P-length 
vectors pk, for each k , and the (P + l)P/2 second-order products in the covariance matrix Pkl^k- 
Similarly, for each k , the variational posterior of A involves the (P + l)P/2 sufficient statistics 
in the symmetric matrix A& as well as the term log IA& |. The sufficient statistics for the posterior 
of 7Tk are the K terms log^/c^] So, minimally, Eq. ([ 7 ]) will require the inverse of a matrix of size 
0(KP 2 ). The sufficient statistics for z have dimension K x N. Though the number of parameters 
thus grows with the number of data points, H z — 0 for the multivariate normal (see Appendix [F]), 
so we can apply Eq. § to replace the inverse of an O (KN )-sized matrix with multiplication by 
the same matrix. Since a matrix inverse is cubic in the size of the matrix, the worst-case scaling for 
LRVB is then 0(K 2 5 ) in K , 0(P 6 ) in P, and O(N) in N. 

In our simulations (Fig. Q) we can see that, in practice, LRVB scales linearl)0in N and approx¬ 
imately cubically in P across the dimensions considered]^] The P scaling is presumably better than 
the theoretical worst case of 0(P 6 ) due to extra efficiency in the numerical linear algebra. Note that 
the vertical axis of the leftmost plot is on the log scale. At all the values of N, K and P considered 
here, LRVB was at least as fast as Gibbs sampling and often orders of magnitude faster. 


4 Conclusion 

The lack of accurate covariance estimates from the widely used mean-field variational Bayes (MFVB) 
methodology has been a longstanding shortcoming of MFVB. We have demonstrated that in sparse 
models, our method, linear response variational Bayes (LRVB), can correct MFVB to deliver these 
covariance estimates in time that scales linearly with the number of data points. Furthermore, we 
provide an easy-to-use formula for applying LRVB to a wide range of inference problems. Our 
experiments on a diverse set of models have demonstrated the efficacy of LRVB, and our detailed 

4 Since 77 k — L using K sufficient statistics involves one redundant parameter. However, this does not violate any 

of the necessary assumptions for Eq. {?}, and it considerably simplifies the calculations. Note that though the perturbation 
argument of Section [^requires the parameters of p(6\x) to be in the interior of the feasible space, it does not require that the 
parameters of p(x\9)be interior. 

5 The Gibbs sampling time was linearly rescaled to the amount of time necessary to achieve 1000 effective samples in the 
slowest-mixing component of any parameter. Interestingly, this rescaling leads to increasing efficiency in the Gibbs sampling 
at low P due to improved mixing, though the benefits cease to accrue at moderate dimensions. 

6 For numeric stability we started the optimization procedures for MFVB at the true values, so the time to compute the 
optimum in our simulations was very fast and not representative of practice. On real data, the optimization time will depend 
on the quality of the starting point. Consequently, the times shown for FRVB are only the times to compute the FRVB 
estimate. The optimization times were on the same order. 
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Figure 4: Scaling of LRVB and Gibbs on simulation data in both log and linear scales. Before taking 
logs, the line in the two lefthand (N) graphs is y oc x, and in the righthand (P) graph, it is y oc x 3 . 


study of scaling of mixtures of multivariate Gaussians shows that LRVB can be considerably faster 
than traditional MCMC methods. We hope that in future work our results can be extended to more 
complex models, including Bayesian nonparametric models, where MFVB has proven its practical 
success. 

Acknowledgments. The authors thank Alex Blocker for helpful comments. R. Giordano and 
T. Broderick were funded by Berkeley Fellowships. 
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Appendices 

You can find this paper, as well as all the code necessary to run the described experiments, in our 
Github repo, rgiordan/LinearResponseVariationalBayesNIPS2015, 


A LRVB estimates of the covariance of functions 

In Section [Z2] we derived an estimate of the covariance of the natural sufficient statistics, 6, of our 
variational approximation, q(0). In this section we derive a version of Eq. 0 for the covariance of 
functions of 0. 

We begin by estimating the covariance between 0 and a function (f)(0). Suppose we have an 
MFVB solution, q(0), to Eq. |T]). Define the expectation of (f)(0) to be E q [(f)(0)] := f(m). This 
expectation is function of m alone since m completely parameterizes q. As in Eq. ([3]), we can 
consider a perturbed log likelihood that also includes / (m ): 

\ogp t (0\x) = \ogp + to + tff (to) := logp + t T m,f 

1 '= ( X ) 

Using the same reasoning that led to Eq. ([4]), we will define 

dm * /v 

Xet = Co v p {p,<j>{0)) 


We then have the following lemma: 

Lemma A.l. If E q [(f)(0)] =: f(m) is a differentiable function ofm with gradient V/, then 


^ = sv/ 

Proof The derivative of the perturbed ELBO, E t , is given by: 

E t := E + tTfrif 

— = — + ( I V/ ) ( to 
dm dm { J ’ V tf 


The fixed point Eq. ^ then gives: 


M t (m) := M(m) + (I V/ ) ^ ^ *) 


dm+ 


dm+ 





dm* 

dt T 


+ (I V/) 


11 












The term ( I V/ ) 


to 

tf 


dm+ 

~dF 


dm* 

dt T 


is awkward, but it disappears when we evaluate at t = 0 , giving 
f DM \ dm* 


+(I V/) 


l dm T _ # J 
( d 2 E \ dm* . T . 

V^^T +/ j^ + V/ 

\ dmdm T J ' 


I V/ 


Recalling that 


We can plug in to see that 


dm* 


:= E 


y dm - 

SV/ 


(ii) 
□ 

Finally, suppose we are interested in estimating Cov p (7(0), 0(0)), where g(m) := E g [ 7 ( 0 )]. 
Again using the same reasoning that led to Eq. Q, we will define 

= Cov p ( 7 (0), 0(0)) « dE ^ (6l)] =: 7^ 

Proposition A.2. 7jfE g [0(0)] = f(m) and¥, q [ 7 ( 0 )] = g(m) are differentiable functions of m with 
gradients V/ Vg respectively, then 

e 70 = v 5 t sv/ 


Proof By Lemma A. 1 an application of the chain rule, 


— 


<*9 [7(0)] dg ( to ) 


dtf 


dtf 


dg{m) dm 
dm T dtf 


= Vg T ZVf 


□ 


B Exactness of LRVB for multivariate normal means 

For any target distribution p{6\x), it is well-known that MFVB cannot be used to estimate the co- 
variances between the components of 0. In particular, if q* is the estimate of p(0\x) returned by 
MFVB, q* will have a block-diagonal covariance matrix—no matter the form of the covariance of 

p(9\x). 

Consider approximating a multivariate Gaussian posterior distribution p(6\x) with MFVB. The 
Gaussian is the unique distribution that is fully determined by its mean and covariance. This poste¬ 
rior arises, for instance, given a multivariate normal likelihood p(x\p) = rin=i iv N(x n \p,, S) with 
fixed covariance S and an improper uniform prior on the mean parameter g. We make the mean 
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field factorization assumption q(p) = Y\ d=1 . D q{pd), where D is the total dimension of p. This 
fact is often used to illustrate the shortcomings of MFVB (H 20,123]. In this case, it is well known 
that the MFVB posterior means are correct, but the marginal variances are underestimated if S is 
not diagonal. However, since the posterior means are correctly estimated, the LRVB approximation 
in Eq. (|7| is in fact an equality. That is, for this model, E = dm t /dt T = E exactly. 

In order to prove this result, we will rely on the following lemma. 

Lemma B.l. Consider a target posterior distribution characterized by p(9\x) = N(0 |/i, E), where 
p and E may depend on x, and E is invertible. Let 0 = (#i,..., Oj), and consider a MFVB 
approximation to p(0\x) that factorizes as q(0) = Ylj q{0j)- Then the variational posterior means 
are the true posterior means; i.e. rrij = pj for all j between 1 and J. 

Proof The derivation of MFVB for the multivariate normal can be found in Section 10.1.2 of (2); 
we highlight some key results here. Let A = E -1 . Let the j index on a row or column correspond 
to Oj, and let the — j index correspond to {0i : i G [J] \ j}. E.g., for j = 1, 


An Ai_i 
A-i,i A-i,-i 


By the assumption that p(0\x) = J\T(6\p, E), we have 


logpiOjlOi^j^j.x) 

= — - {Oj — pj) A jj(6j — pj) + (Oj — pj) A j,-j(0-j — p~j) + C, (12) 

where the final term is constant with respect to Oj . It follows that 


log q*(0j) = ^ q * :i e[J]\j log p(0,x) + C 

= _ ^33®3 + ®3l 1 3^33 ~ ^3^h~3^Q*^~ 




So 

q*{0j)=N{0j\mj,^f 

with mean parameters 

171 j = @3 Fj ~ A jj Aj : -j(rri-j — P-j) (13) 

as well as an equation for E q * 0 T 0. 

Note that A jj must be invertible, for if it were not, E would not be invertible. 

The solution m — p is a unique stable point for Eq. ( [13] ), since the fixed point equations for each 
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j can be stacked and rearranged to give 

0 iV n iV 12 


m — fi = 


A-i^A^ ••• A 11 1 A 1 (j_ 1 ) A 11 1 Au 


Aj}Aji AjjAj 2 ••• A J ]Aj(j_ 1) 
0 


An 1 


A7i 


Aji A j 2 


0 = 


A 


li 


0 ••• 0 
0 Ai2 


Aji A j 2 
0 = A {m — p) 

m = fi. 


(m - p) + 


Ajj 


Ai(j-i) Au 


A 


j(j-i) 


(m — p) 


(m — p) 


0 Ai 2 • • • Ai (j-i) A 


i(j-i) 


A 


j(j-i) 


(m — /i) O 


The last step follows from the assumption that E (and hence A) is invertible. It follows that fi is the 
unique stable point of Eq. ( [13] ). 

□ 


Proposition B.2. Assume we are in the setting of Lemma \B.1\ where additionally /i and E are on 
the interior of the feasible parameter space. Then the LRVB covariance estimate exactly captures 
the true covariance, E = E. 


Proof Consider the perturbation for LRVB defined in Eq. <0- By perturbing the log likelihood, 
we change both the true means p t and the variational solutions, m t . The result is a valid density 
function since the original p and E are on the interior of the parameter space. By Lemma |BT| the 
MFVB solutions are exactly the true means, so m t j = pt,j , and the derivatives are the same as well. 
This means that the first term in Eq. ([7]) is not approximate, i.e. 


dm t 

dt T 


- dt T E P t d -^ 


tl 


It follows from the arguments above that the LRVB covariance matrix is exact, and E = E. 

□ 


C Comparison with supplemented expectation-maximization 

The result in Appendix [B] about the multivariate normal distribution draws a connection between 
LRVB corrections and the “supplemented expectation-maximization” (SEM) method of 02. SEM 
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is an asymptotically exact covariance correction for the EM algorithm that transforms the full-data 
Fisher information matrix into the observed-data Fisher information matrix using a correction that 
is formally similar to Eq. 0- In this section, we argue that this similarity is not a coincidence; in 
fact the SEM correction is an asymptotic version of LRVB with two variational blocks, one for the 
missing data and one for the unknown parameters. 

Although LRVB as described here requires a prior (unlike SEM, which supplements the MLE), 
the two covariance corrections coincide when the full information likelihood is approximately log 
quadratic and proportional to the posterior, p(0\x). This might be expected to occur when we have 
a large number of independent data points informing each parameter—i.e., when a central limit 
theorem applies and the priors do not affect the posterior. In the full information likelihood, some 
terms may be viewed as missing data, whereas in the Bayesian model the same terms may be viewed 
as latent parameters, but this does not prevent us from formally comparing the two methods. 

We can draw a term-by-term analogy with the equations in nm We denote variables from the 
SEM paper with a superscript “SEM” to avoid confusion. MFVB does not differentiate between 
missing data and parameters to be estimated, so our 6 corresponds to (0 SEM , Y^ EM ) in (121 . SEM 
is an asymptotic theory, so we may assume that (0 SEM : Y^ EM ) have a multivariate normal distri¬ 
bution, and that we are interested in the mean and covariance of Q SEM . 

In the E-step of H2l . we replace Y^ EM with its conditional expectation given the data and other 
0 s em' Tlh s corresponds precisely to Eq. (l3| ), taking Oj = Y^ EM . In the M-step, we find the 
maximum of the log likelihood with respect to Q SEM , keeping Y^ EM fixed at its expectation. Since 
the mode of a multivariate normal distribution is also its mean, this, too, corresponds to Eq. 
now taking Oj = Q SEM . 

It follows that the MFVB and EM fixed point equations are the same; i.e., our M is the same as 
their M sem , and our dM/dm of Eq. § corresponds to the transpose of their DM sem , defined 
inEq. (2.2.1) of ITU Since the “complete information” corresponds to the variance of Q SEM with 
fixed values for Yq^IY , this is the same as our S g * 5 n, the variational covariance, whose inverse is 
I~}. Taken all together, this means that equation (2.4.6) of lfl2l can be re-written as our Eq. ([7]). 


ySEM =/ -l (J _ DM SEMj 

M-m T ) 


-i 



dM \ _1 
dm T ) 


V 


D Normal-Poisson details 


In this section, we use this model to provide a detailed, step-by-step description of a simple LRVB 
analysis. 

The full joint distribution for the model in Eq. ([9]) is 


log p(y,z,0,r) 


iV ^ 1 ^ 

Y y2 TZ ^ +XnT/3Zn ~ 2 X ™ T ^ 2 

n= 1 ' w 



+ V (- exp (z n ) + Z n y n ) - —2 /3 2 + (a T - 1) logr - (5 t t + C 
n = 1 za 0 


We find a mean-field approximation under the factorization q (/3, r,z) = q (/?) q (r) n^Li Q ( z n )• 
By inspection, the log joint is quadratic in /?, so the optimal q (/?) will be Gaussian (2). Similarly, 


15 




the log joint is a function of r only via r and log r, so the optimal q (r) will be gamma. However, 
the joint does not take a standard exponential family form in z n : 

log p (z n \y, /?, T ) = (x n r/3 + y„) z n - - exp (z n ) + C 

The difficulty is with the term exp ( z n ). So we make the further restriction that 

q (z n ) = N (•) = q (z n ; E [z n \ , E [z 2 n ]) . 

Fortunately, the troublesome term has an analytic expectation, as a function of the mean parameters, 
under this variational posterior: 

E 9 [exp (z n )] = exp ^E ? [z n \ + 1 (e 9 [z%\ - E g [ 2 n ] 2 )^ . 

We can now write the variational distribution in terms of the following mean parameters: 

TO = (Eg [/?] , Eg [/ 6 2 ] , Eg [r] ,Eg [bg T] , Eg [z^ ,Eg [zf] , . . . , Eg [z N ] , Eg [^ ] ) ^ • 

Calculating the LRVB covariance consists of roughly four steps: 


1. finding the MFVB optimum q*, 

2. computing the covariance V of g*, 

3. computing H , the Hessian of L(m), for q*, and 

4. computing the matrix inverse and solving (/ — VH) -1 V. 


For step (1), the LRVB correction is agnostic as to how the optimum is found. In our exper¬ 
iments below, we follow a standard coordinate ascent procedure for MFVB (2|. We analytically 
update q (/3) and q (r). Given q (/3) and q (r), finding the optimal q (z) becomes N separate two- 
dimensional optimization problems; there is one dimension for each of the mean parameters E q [z n \ 
and [z%] • In our examples, we solved these problems sequentially using IPOPT ETI . 

To compute V for step (2), we note that by the mean-field assumption, /?, r, and z n are indepen¬ 
dent, so V is block diagonal. Since we have chosen convenient variational distributions, the mean 
parameters have known covariance matrices. For example, from standard properties of the normal 


distribution, Cov (/?, /? 2 ) = 2E g \/3\ (e, [f3 2 ] - Eg [/?] 2 ). 


For step (3), the mean parameters for /3 and r co-occur with each other and with all the z n , so 
these four rows of H are expected to be dense. However, the mean parameters for z n never occur 
with each other, so the bulk of H — the 2 N x 2 N block corresponding to the mean parameters of 
z—will be block diagonal (Fig. ([5b])). The Hessian of L (m) can be calculated analytically, but we 
used the autodifferentiation software JuMP Co|. 

Finally, for step (4), we use the technique in Section 2.3 to exploit the sparsity of V and H 
(Fig. (5c)) in calculating (/ — VH) -1 . 
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(a) MFVB covariance V (b) Hessian matrix H (c) (/ — VH) 

Figure 5: Sparsity patterns for E = (I — VH)- 1 using the model in Eq. d9b, n = 5 (white = 0) 


E Random effects model details 

As introduced in Section [331 our model is: 

Vn \P,z,T m ~ ep M{P T x n + r n Z k ( n ),T~ 1 ) 
Zk \v Af ( 0, i' -1 ) 

With the priors: 

P ~ N (0,^/3) 
v ~ r (a„, p u ) 
t ~ r( a j T ) 

We will make the following mean field assumption: 


K 

q(f3,z,T,p) = q(v)q(T)q(P)Y[q(z k ) 

k=1 


We have n E {1,TV}, and k G {1,iV}, and k (n) matches an observation n to a random effect 
k, allowing repeated observations of a random effect. The full joint log likelihood is: 


log p{y n \z k (n),T, P) 

logp { z k\v) 

log p(P) 
log p(t) 
log p{v) 

log p(y,T,P,z) 


(Vn ~ P T X n ~ r n z k(n )) 2 + \ log T + C 

V O 1 . ^ 

-^i + 2 logv + c 

— ^trace ( '^ l PP T ) + C 

(a T - 1) log t - P t t + C 
(a v - 1) log v - p v v + C 

N K 

^2 log P (: Vn\z k (n),T, P) + ^ log P (. Z k \v ) + 

n=1 k=1 

log p(P) + logp(v) + log p(t) 


Expanding the first term of the conditional likelihood of y n gives 

-Vvn- P T X n ~ r n Z k (n)) 2 

= ~\\Vn- 2 VnXnP - 2y n r n z n{k) + trace (x n x£/3/? T ) + r 2 n z 2 k{n) + 2r n x^Pz k{n) ) 
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By grouping terms, we can see that the mean parameters will be 


q(fi) = q{P;E q [/3],E q [PP T ]) 
q(z k ) = q(z k ;E q [z k ],E q [z k ]) 
q( T ) = q (r; E g [r], E g [log r]) 
q(y) = q (i/; E q [v\ , E q [log u\) 


It follows that the optimal variational distributions are q (/3) =multivariate normal, q(zk) =univariate 
normal, and q (r) and q (n) will be gamma. We performed standard coordinate ascent on these dis¬ 
tributions m. 

As in Section HU we implemented this model in the autodifferentiation software JuMP II101 . 
This means conjugate coordinate updates were easy, since the natural parameters corresponding to 
a mean parameters are the first derivatives of the log likelihood with respect to the mean parameters. 
For example, denoting the log likelihood at step s by L s , the update for q s +i ( Zk ) will be: 


log<? S +l 0/c) 


dEM , [Ls] jz , r 

dE q [z k ] k+ dE q [zl\ k + C 


Given the partial derivatives of L s with respect to the mean parameters, the updated mean parameters 
for Zk can be read off directly using standard properties of the normal distribution. 

The variational covariance matrices are all standard. We can see that H will have nonzero terms 
in general (for example, the three-way interaction E q [r] E q [^( n )] [/?]), and that LRVB will be 

different from MFVB. As usual in our models, H is sparse, and we can easily apply the technique 
in section Section [23] to get the covariance matrix excluding the random effects, z. 


F Multivariate normal mixture details 


In this section we derive the basic formulas needed to calculate Eq. for a finite mixture of nor¬ 
mals, which is the model used in Section [3] We will follow the notation introduced in Section [53] 
Let each observation, x n , be a P x 1 vector. We will denote the Pth component of the nth 
observation x n , with a similar pattern for 2 and fi. We will denote the p , qth entry in the matrix 
as A k, pq - The data generating process is as follows: 


P ( x \n, 7T, A) 


logP (x n \z n , fj,, A) 
log (t>k{x) 
log P(z nk \TT k ) 


N K 

II P(Xn\Zn,(J>, A) P(z nk \TT k ) 

n =1 k=1 

N 

^ ^ Znk T C 

n =1 

1 T 1 

A- k (x — /i k ) + - log |Afc| + C 

K 

Znk log 7 T k + C 

k=1 
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It follows that the log posterior is given by 



n= 1 k=l 


K 


K 


p, 1o sp(^) + P, lo sp( A fe)+ l °s^( 7r ) + c 


k =1 


k =i 


We used a multivariate normal prior for fi^, a Wishart prior for A&, and a Dirichlet prior for tv. In 
the simulations described in Section [373] we used the following prior parameters for the VB model: 

p(fJ-k) = V (0p,diag P (0.01) _1 ) 
p{ A fc ) = Wishart(diag P (0.01), 1) 


p( 7r) = Dirichlet (5 x) 


Here, diag P (a) is a P-dimensional diagonal matrix with a on the diagonal, and Op is a length P 
vector of the value 0, with a similar definition for 5^- Unfortunately, the function we used for the 
MCMC calculations, rnmixGibbs in the package bayesm, uses a different form for the fip prior. 
Specifically, rnmixGibbs uses the prior 


Pmcmc {Pk\ A k) = A/"(0, a 1 A k - 1 ) 


where a is a scalar. There is no way to exactly match PMCMc(Pk) to p(/i/c), so we simply set 
a = 0.01. Since our datasets are all reasonably large, the prior was dominated by the likelihood, and 
we found the results extremely insensitive to the prior on /i^, so this discrepancy is of no practical 
importance. 

The parameters A&, tv, and z n will each be given their own variational distribution. For 
q flk we will use a multivariate normal distribution; for q\ k we will us a Wishart distirbution; for q n 
we will use a Dirichlet distribution; for q Zn we will use a Multinoulli (a single multinomial draw). 
These are all the optimal variational choices given the mean field assumption and the conditional 
conjugacy in the model. 

The sufficient statistics for /i*. are all terms of the form and PkpPkq- Consequently, the 
sub-vector of 0 corresponding to fip is 


pm \ 


pkp 


PklPkl 

PklPk2 


\ PkPPkP ) 


We will only save one copy of PkpPkq and PkqPkpi so 0 kLk has length P + \ (P + 1) P. For all the 
parameters, we denote the complete stacked vector without a k subscript: 
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The sufficient statistics for are all the terms A k, pq and the term log |Afc|. Again, since A is 
symmetric, we do not keep redundant terms, so 0\ k has length lp|(Ppl)P. The sufficient 
statistic for 7 r is the P-vector (log7Ti, ...,log7 tk)- The sufficient statistics for z are simply the 
N x P values z n k themselves. 

In terms of Section [231 we have 


a = 



(«.) 


That is, we are primarily interested in the covariance of the sufficient statistics of /i, A, and 7 r. The 
latent variables 2 are nuisance parameters. 

To put the log likelihood in terms useful for LRVB, we must express it in terms of the sufficient 
statistics, taking into account the fact the 6 vector does not store redundant terms (e.g. it will only 
keep Aab for a < b since A is symmetric). 


1 T 

2 {%n /p) 


= -^trace (A fc (x n - fi k ) (x n - ^) T ) 

a b 

= 4ee {^k,abl^k,af^k,b -^k,ab^n,a^k,b ^k,ab^n,bl^k,a P ^k,ab%n,a%n,b) 

a b 

— 2 ^ y ^k,aa (/p) P ^ ^ 2 ^ ^ ^-k,aa (^p) 

a a a 

2 ^ y ^k^ab!^k : al^k : b P ^ ^ ^-k,ab^n,af^k,b 2 ^ > A-k,ab%n,a%n,b 

a^b a^b a^b 

= ^ ^ (/p) P ^ ^ ^-fe,QQ^n,a/^/c,Q ^ ^ v 


( 2 P - 


E h.k,abHk,aHk,b + yp k,ab (%n,a^k,b P 3p,&AP,a) E A-k,ab%n,a%n,b 

a<b a<b a<b 


The MFVB updates and covariances in 1/ are all given by properties of standard distributions. To 
compute the LRVB corrections, it only remains to calculate the Hessian, H. These terms can be 
read directly off the posterior. First we calculate derivatives with respect to components of /i. 


8 2 H 

^/^k,a^-^-k,ab 

d 2 h 

& {f^k,af^k,b) d-^k,ab 

8 2 h 

^l^k,a^^nk 

d 2 H 


^ ^ Znk%n,b 

1 (a=b) 

^ ^ Znk 
n 

^ ^ -^k,ab%n,b 
b 

1 (a=b) 

A/c,a6 




& {/^k,a^k,b^) &Z U k 
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All other // derivatives are zero. For A, 


dA-k,abdZnk 

d 2 H 


d 2 h 



1 


Slog I A/e I dz nk 2 

The remaining A derivatives are zero. The only nonzero second derivatives for log i r are to Z and 
are given by 


d 2 H 


Slog? v k dz nk 

Note in particular that H zz = 0, allowing efficient calculation of Eq. ([8|. 


G MNIST details 


For a real-world example, we applied LRVB to the unsupervised classification of two digits from 
the MNIST dataset of handwritten digits. We first preprocess the MNIST dataset by performing 
principle component analysis on the training data’s centered pixel intensities and keeping the top 25 
components. For evaluation, the test data is projected onto the same 25-dimensional subspace found 
using the training data. 

We then treat the problem of separating handwritten Os from Is as an unsupervised clustering 
problem. We limit the dataset to instances labeled as 0 or 1, resulting in 12665 training and 2115 
test points. We fit the training data as a mixture of multivariate Gaussians. Here, K = 2, P = 25, 
and N = 12665. Then, keeping the /i, A, and i r parameters fixed, we calculate the expectations 
of the latent variables z in Eq. © for the test set. We assign test set data point x n to whichever 
component has maximum a posteriori expectation. We count successful classifications as test set 
points that match their cluster’s majority label and errors as test set points that are different from 
their cluster’s majority label. By this measure, our test set error rate was 0.08. We stress that we 
intend only to demonstrate the feasibility of LRVB on a large, real-world dataset rather than to 
propose practical methods for modeling MNIST. 
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